Fix (all?) parallel hyperopt issues#2415
Merged
scarlehoff merged 7 commits intomasterfrom Jan 29, 2026
Merged
Conversation
fbd2fde to
7c06f3b
Compare
Member
Author
|
This is now ready. The only point I haven't been able to really fix is the creation of the cache hashes in the working directory. One solution is to run each worker in a separate node and only the one holding the database locally. But this is a cluster-dependent solution. Btw, in principle you can run in parallel in different folders, nodes or computer, and the |
b50218f to
e1f73dc
Compare
Member
Radonirinaunimi
left a comment
There was a problem hiding this comment.
Sorry for the delay. Here are some comments.
n3fit run, server now is started only if a connection to a database fails, --restart is removed, now only restart is allowed
Co-authored-by: Tanjona R. Rabemananjara <rrabeman@nikhef.nl>
f5506c4 to
33d7aec
Compare
Radonirinaunimi
approved these changes
Jan 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
With this PR we can now submit to multiple nodes in parallel (technically even to multiple clusters if you are feeling lucky).
The memory footprint of mongodb itself is much smaller as well.
Now an instance of
n3fitcan only spawn one single mongo worker, so it doesn't matter whether they are running in the same node or not, every worker connects to the database in the same manner. The num mongo workers option has been removed.Now when a parallel hyperopt is going to run, each worker checks whether there's a database already running, if there is none, it starts the database and writes down the address. All other instances of
n3fitwill find an address and will try to connect.There's also no longer a
--restartoption. Every new run is always a--restart. If you didn't want to continue a previous run just change the name of the runcard, but overwritting previous runs is very impolite.Also, the database is no longer written in a separate folder then compressed but only when finishing cleanly and so on. Now the database is always in the
nnfitfolder, it will be compressed atvp-uploadbut not before (this is a problem because it is so big...), but only if the--upload-dboption is used. Otherwise it skips the database during the upload. This makes the hyperopt not as heavy on the nnpdf server.To do:
mongodbworks well. In particular finding automatically in which node the database is running. Seems to work but maybe a lockfile to ensure two databases don't start at once.Remove the hash files in the working directory. Not sure whether this is possible but it is quite ugly.Can't find a way to do this 🤷♂️